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Abstract 

A random variable is sampled from a discrete distribution. The missing mass is the 
probability of the set of points not observed in the sample. We sharpen and simplify 
McAllcster and Ortiz's results (JMLR, 2003) bounding the probability of large devia- 
tions of the missing mass. Along the way, we refine and rigorously prove a fundamental 
inequality of Kearns and Saul (UAI, 1998). 

1 Introduction 

Hoeffding's classic inequality [3] states that If X is a [a, 6]-valued random variable with EX = 
then 

Ee tX < e (6-a)^ /8j f > Q (1) 

A standard proof of ([I]) proceeds by writing x 6 [a, b] as x = pb + (1 — p)a, for p = —a/(b — a), 
and using convexity to obtain 

Ee tX/(b-a) < ^ _ p y-tp + pe t(l-p) ._ fty < //S^ ( 2 ) 

where the last inequality follows by noticing that log/(0) = [log f(t)]'\ t=0 = and that 
[log/(i)]"<l/4. 

Although ([T|) is tight, it is a "worst-case" bound over all distributions with the given 
support. Refinements of ([1]) include the Bernstein and Bennett inequalities [5], which take 
the variance into account — but these are also too crude for some purposes. 

In 1998, Kearns and Saul [1] put forth an exquisitely delicate inequality for (generalized) 
Bernoulli random variables, which is sensitive to the underlying distribution: 



(1 - rt e-» + pe«-> < exp ( ^ ) , , 6 [0, 1], * 6 



(3) 



One easily verifies that ([3]) is superior to ([T]) — except for p = 1/2, where the two coincide. In 
fact, ([3]) is optimal in the sense that, for every p, there is a t for which equality is achieved. The 
Kearns-Saul inequality allows one to analyze various inference algorithms in neural networks, 
and the influential paper [1] has inspired a fruitful line of research [U d EJ [TO] . 

One specific application of the Kearns-Saul inequality involves the concentration of the 
missing mass. Let p = (pi,P2, ■ ■ ■) be a distribution over N and suppose that X\, X2, • • ■ , X n 
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are sampled iid according to p. Define the indicator variable £j to be if j occurs in the 
sample and 1 otherwise: 

& = ^{Xi, ..,*„}}> 3 e N - 
The missing mass is the random variable 

jeN 

McAllester and Schapire [8J first established subgaussian concentration for the missing mass 
via a somewhat intricate argument. Later, McAllester and Ortiz [7] showed how the standard 
inequalities of Hoeffding, Angluin- Valiant, Bernstein and Bennett are inadequate for obtaining 
exponential bounds of the correct order in n, and developed a thermodynamic approach for 
systematically handling this problenJE 

We were led to the Kearns-Saul inequality ([3]) in an attempt to understand and simplify 
the missing mass concentration results of McAllester and Ortiz [7J , some of which rely on ([3]) . 
However, we were unable to complete the proof of ([3]) sketched in [3], and a literature search 
likewise came up empty. The proof we give here follows an alternate path, and may be of 
independent interest. As an application, we simplify and sharpen some of the missing mass 
concentration results given in [U [7] . 



2 Main results 

In [J Lemma 1], Kearns and Saul define the function 



1 



g(t) = log (1 - p)e tp + pe 



t G 



(5) 



A natural attempt to find the maximum of g leads one to the transcendental equation 

(e i -l)(l-p)pt-2(l + (e t -l)p)log[l + (e*-l)pe-P*] 
9{t) = (1 + (e* - l)p)*3 = °- 

In an inspired tour de force, Kearns and Saul were able to find that g'(t*) = for 

1 — p 



t* = 2 loe 



P 



This observation naturally suggests (i) arguing that t* is the unique zero of g' and (ii) supplying 
(perhaps via second-order information) an argument for t* being a local maximum. In fact, 
all evidence points to g'(t) having the following properties: 

(*) </>0on (-oo,f). 

(**) g > = at t = t*, 

(***) g' < on (i*,oo). 



The latter has, in turn, inspired a general thermodynamic approach to concentration [B]. 
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Unfortunately, besides straightforwardly verifying (**), we were not able to formally establish 
(*) or (***) — and we leave this as an intriguing open problem. Instead, in Theorem 0] we 
prove the Kearns-Saul inequality ([3]) via a rather different approach. Moreover, for p > 1/2 
and t > 0, the right-hand side of ([3]) may be improved to exp[p(l — p)t 2 j2\. This refinement, 
proved in Lemma may be of independent interest. 

As an application, we recover the upper tail estimate on the missing mass in [TJ Theorem 

16]: 

Theorem 1. 

¥(U n > EU n + e) < e~ n£2 . 
We also obtain the following lower tail estimate: 
Theorem 2. 

¥{U n < KU n - e) < e - Con£2/ \ 

where 

C = inf — 2 n 7.6821. 

0<x<l/2 X[l — X) log(l/x) 

Since Co/4 ~ 1.92, Theorem [2] sharpens the estimate in [7J Theorem 10], where the 
constant in the exponent was e/2 « 1.36. Our bounds are arguably simpler than those in [7] 
as they bypass the thermodynamic approach. 



3 Proofs 

The following well-known estimate is an immediate consequence of ([2]): 
Lemma 3. 

\^T l + ^e* = cosht < e* 2/2 , t£R. 

We proceed with a proof of the Kearns-Saul inequality. 
Theorem 4. For all p G [0, 1] and t £ R, 



(1 _ p)e -.P + ^a-P)< exp ( 4iog( '-^ )/ri ^. (6) 

Proof. The cases p = 0, 1 are trivial. Since 

1 - 2p 

lim 77 = 1/2, 

p-H/2 log((l - p)/p) 

for p = 1/2 the claim follows from Lemma [3l 

For p ^ 1/2, we multiply both sides of © by e tp , take logarithms, and put t = 2s log((l — 
p) /p) to obtain the equivalent inequality 

8 (s + 2p(l - s)) log((l - p)/p) - log (1 - p + p((l - p)/p) 2s ) > 0. (7) 
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For s£i, denote the left-hand side of ([7]) by h s (p). A routine calculation yields 

fc.(l/2) = /£(l/2)=0 (8) 

and 

' (/i - l)p 2 - s + p(l - /i + s + lis) s 



*?(p) 



p(l-p)(l + (/x-l)p) 



where 11 = ((1 — p)/p) 2s - 

As /i" > 0, we have that h s is convex, and from (jSJ) it follows that /i s (p) > for all s,p. □ 

We will also need a refinement of ©: 

Lemma 5. For p € [1/2,1] and i > 0, 

1 



(1 _ p ) e -tP + pe^'P)] < P±LJ°1 . (9) 



Remark: Since the right-hand side of © majorizes the right-hand side of ([9]) uniformly 
over [1/2, 1], the latter estimate is tighter. 

Proof. The claim is equivalent to 

L(p) := (1 -p) +pe l < exp(pt+p(l -p)t 2 /2) =: R(p), t > 0. 

For t > 4, we have 

exp(pt +p(l -p)t 2 /2) > exp(pt + p(l-p)(4±)/2) 

= ex.p(pt + 2p(l — p)t) 
> exp(pt + (1 - p)t) = e l > L(p). 

For < t < 4, 

R"{p) - L"{p) = 1 exp(pt(2 + i - pi)/2)(2p - l)t 3 ((2p - l)t - 4), pG [1/2, 1], 

which is obviously non-positive. Now the inequality clearly holds at p = 1 (as equality), and 
the p = 1/2 case is implied by Lemma [3l The claim now follows by convexity. 

□ 

Our numerical constants are defined in the following lemma, whose elementary proof is 
omitted: 

Lemma 6. Define the function 

f{x) = x(l-x)log(l/x), x e (0,1/2). 
Then xq ~ 0.2356 is the unique solution of f(x)' = on (0,1/2). Furthermore, 

C := inf 2/f(x) = 2/f(x ) « 7.6821. (10) 

0<x<l/2 
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The main technical step towards obtaining our missing mass deviation estimates is the 
following lemma. 

Lemma 7. Let n > \, A > 0, p G [0, 1], and put q = (1 — p) n . Then: 
(a) 

qe Kp-vi) + (i _ g ) e -Apa < eX p(pA 2 /4n). 

qe Kpi-v) + (x _ g ) e A P9 < exp(pA 2 /C n). 
Proof. (a) We invoke Theorem [J] with p = q and i = Xp to obtain 

9e A( P -pg) + (1 _ q y-x Pq < exp[(1 _ 2g )AV/41og[(l - g)/g]]. 
Thus it suffices to show that 

(1 - 2g)AV/41og[(l - q)/q] < p\ 2 /4n, 

or equivalently, 

(1 - 2g)p/log[(l - q)/q] < log(l -p)/logg, p,q£ [0,1]. 

Collecting the p and q terms on opposite sides, it remains to prove that 

(l-2g)log(l/g) log(l/(l-p)) 

L(g):= — — m rr~r < =■ R(P), 0<p,q<l. 

log[(l - g)/g] p 

We claim that L < 1 < R. The second inequality is obvious from the Taylor expansion, 
since 

l0S(1/(1 " P)) = 1 + p/2 + p 2 /3 + P 3 /4 + • • • • (11) 
P 

To prove that L < 1, we note first that -L(g) > L(l — q) for g £ (0,1/2). Hence, it 
suffices to consider q £ (0, 1/2). To this end, it suffices to show that the function 

f{q) = log[(l - q)/q] - (1 - 2q) log(l/g) 

is positive on (0, 1/2). Since lim/(g) — >0 = /(1/2) and 

g->-0 

-2 + 3g - 2q 2 

(i - q) 2 q 

it follows that / > on [0, 1/2]. 
(b) The inequality is equivalent to 



L(A) := log [ge"^ 1 "*) + (1 - ?)e 



Ap<j 



A 2 p 2 L J A 2 p 2 Cq log (// log(l — p) 



<4, „. A . 2p .. 
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where L is obtained from the left-hand side of ([6]) after replacing p by 1 — q and t by Xp. 
We analyze the cases q < 1/2 and q > 1/2 separately (as above, the case where q = 1/2 
is trivial). For q > 1/2, put A* = | log > and invoke Theorem [4] to conclude that 
sup A>0 L(A) < L(X*). Hence, it remains to prove that L(X*) < R, or equivalently, 

(2g-l) < 1 X 2 p 



4 log(g/(l - q)) X V C log g/ log(l - p) ' 
After simplifying, this amounts to showing that 

>g(l/(l-p)) log(g/(l-g)) 



p (2g-l)log(l/g) 



>C . 



As in (jlip . the factor log [1/(1 — p)]/p is bounded below by 1. We claim that the factor 
(2g-i(iog(i/g) ' mcreases f° r 9 G [1/2, 1]. Indeed, this is obvious for l/log(l/g), and the 
expansion about q = 1/2 

log(g/(l - q)) _ 2« ^ l^ 2n 



(2g - 1) ^ 2n + 1 V 2 

v ' n=0 x 

shows that the same holds for log (^/(^~ g )) . I n particular, 

1 log(l/(l-p)) log( g /(l-g) > 4-1 . Hm log(g/(l-g) 



p (2g-l)log(l/g) " g -.i/2(2g-l)log(l/g) 

= 8/ log 2 « 11.542 > C . 

When g < 1/2, we invoke Lemma [5] together with the observation that 

lim L(A) = 9(1 ~ q) 
to conclude that sup A>0 £(A) < L(0). Hence, it remains to show that 

MVN)) 2 > Co _ 

p g(l - g)log(l/g) ~ 

As in (jlip . log [1/(1 — p)]/p > 1 and the claim follows by Lemma [H 



□ 



Our proof of Theorems [T] and [2] is facilitated by the following observation, also made in [7]. 
Although the random variables whose weighted sum comprises the missing mass dH) are 
not independent, they are negatively associated [2]. A basic fact about negative association is 
that it is "at least as good as independence" as far as exponential concentration is concerned 
Lemmas 5-8]: 

Lemma 8. Let be independent random variables, where £j is distributed identically to £j 
for all j € N. Define also the "independent analogue" ofU n : 



Then for all n € N and e > 0, 
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(a) 

P(U n > EU n + e) < P{U' n >W n + s), 

(b) 

F(U n <EU n -e) < P(E£ < Et£ - e). 

Proof of Theorems [7] and\^ Observe that the random variables £J defined in Lemma [8] have a 
Bernoulli distribution with P(£j = 1) = qj = (1 ~Pj) n and put Xj = £j Using standard 

exponential bounding with Markov's inequality, 

F(U n >EU n + e) < P{U' n >W n + e) 



P 



exp 



Ac 



A > 



= e~ Xe Yl (<lj< X l '- '"' : + (1 - qj)e-^ 
jen 

< e~ Xe Y[exp(pj\ 2 /4n) 
= exp(A 2 /4n - Ae), 

where the last inequality invoked Lemmata). Choosing A = 2ne yields Theorem [TJ 

The proof of the Theorem [2] is almost identical, except that Xj is replaced by —Xj and 
Lemma E[b) is invoked instead of Lemmata). □ 
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